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ABSTRACT 

The  recent  development  of  multiprocessor  systems  that  offer 
resistance  to  faults  by  gracefully  degrading  after  a failure  opens  vast 
new  ranges  of  applications  for  fault  tolerance  and  high  reliability.  The 
paper  presents  a general  model  for  the  evaluation  of  such  systems.  It 
takes  Into  account  the  Internal  structure  of  the  hardware,  the  character- 
istics of  the  various  detection  mechanisms,  the  unreliability  of  the  soft- 
ware and  even  the  type  of  applications  these  systems  are  used  for.  It  pro- 
vides many  measures  of  the  systems'  performance  such  as:  availability, 
meantime  between  crashes,  average  processing  power  and  proportion  of  time 
spent  In  degraded  mode.  System  optimization  gives  the  best  values  for  the 
number  of  processors,  memories,  ...,  and  shows  the  trade-offs  between 
hardware  and  software  fault-detection  mechanisms.  The  model  Is  Illustrated 
by  a concrete  example. 

Index  Terms:  Digital  systems,  fault- tolerance,  graceful  degradation, 

availability,  hardware  unreliability,  software  unreliability, 
Markov  chains. 
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I.  INTRODUCTION 

Most  of  the  past  wrk  on  reliability  modeling  for  digital  systems 
has  been  centered  around  evaluation  of  the  efficiency  of  redundancy  tech- 
niques to  achieve  ultra -reliability.  Triple  Modular  Redundancy  [1-3]. 
hybrid  [4-7],  stand-by  [8  - 10],  self-purging  [11-13]  and  duplex  [14, 

15]  redundancy  techniques  have  been  the  focus  of  much  attention.  Recently, 
software  tools  have  been  developed  [16,  17].  Most  of  these  studies  are  d1- 
rected  towards  reliability  evaluation  of  computer  systems  Intended  for  aero- 
nautic and  space  applications  where  the  cost  of  computer  failure  Is  so  high 
as  to  allow  the  use  of  massive  redundancy.  However,  there  are  many  other 
applications,  like  banking  and  airline  reservations  systems,  where  mas- 
sive redundancy  techniques  are  economically  unattractive  but  where  continuous 
service  Is  still  a must.  In  the  last  few  years  there  have  been  several 
attempts  to  provide  general-purpose  multiprocessor  computers  with  high 
reliability  and  availability  by  capitalizing  on  the  Inherent  redundancy 
of  multiprocessors  [18,  19].  They  can  provide  uninterrupted  service  If 
each  occurrence  of  a failure  Is  detected  and  followed  by  a reconfiguration 
of  the  complete  system  such  that  some  service  Is  still  available  from  the 
reconfigurated  system.  This  philosophy,  to  trade  some  computing  power  for 
continuous  operation.  Is  usually  called  graceful  degradation  [18]. 

Most  of  the  reliability  models  for  ultra-reliable  systems  are  not 
well  suited  for  the  study  of  commercial  systems  that  use  graceful  degra- 
dation. Ultra-reliable  systems,  like  those  used  In  airplanes  or  space- 
ships, are  Intended  and  designed  to  perform  a mission  of  known  length 
without  any  system  failure.  In-flight  repair  Is  usually  Impossible 
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and  Interruption  of  service  for  as  short  a time  as  a few  minutes  Is 
usually  unacceptable  (e.g.,  If  the  Interruption  occurs  during  a computer 
controlled  automatic  landing).  In  contrast,  commercial  applications  put 
more  emphasis  on  availability.  A system  like  an  airline  reservation  sys- 
tem needs  to  be  up  most  of  the  time.  Interruptions  of  service  are  accep- 
table as  long  as  they  are  quite  short  (a  few  minutes)  and  slower  response 
can  also  be  tolerated  for  brief  periods.  Commercial  systems  also  differ 
substantially  from  ultra-reliable  systems  by  their  architecture.  Commer- 
cial systems  for  safe  computing  are  gracefully  degradable  multiprocessors 
with  large  software  while  ultra-reliable  systems  are  In  general  specialized 
uniprocessors,  protected  from  failures  by  external  redundancy,  and  with 
very  limited  software. 

This  paper  presents  a general  model  for  performance  evaluation  of 
gracefully  degradable  systems.  The  model  Is  very  general  so  that  It  can 
be  applied  to  systems  so  different  as  Plurlbus  £20],  C.mmp  [21]  or  PRIME 
[22].  It  differs  substantially  from  the  models  developed  by  Borgerson 
[18]  and  Hayes  [23].  Availability,  down-time  per  system  crash,  percentage 
of  time  In  degraded  mode  and  overhead  due  to  recovery  and  reconfiguration 
are  analyzed.  Both  hardware  and  software  unreliability  are  taken  Into 
account.  Analysis  of  sensitivity  to  such  parameters  as  the  frequency  of 
software  tests  leads  to  fine  system  tuning.  Optimization  completes  the 
study. 
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II.  DESCRIPTION  OF  GRACEFULLY  DE6RADABIE  SYSTEHS 
II. 1 GENERAL  DESCRIPTION 

CoMputIng  systems  can  be  defined  In  broad  terms  as  a set  of 
hardware  resources  managed  by  software  to  provide  users  with  some 
specific  service. 

The  hardware  Is  con^iosed  of  several  resources:  processing,  memories 
for  storage,  I/Os  for  communication  with  the  outside  world,  and  an  Inter- 
communication network  (bus,  crossbar  switch,  etc.)  to  provide  Intercon- 
nection between  the  elements.  Each  type  of  resource  may  contain  several 
elements  that  provide  the  same  kind  of  service;  for  example,  there  may  be 
several  CPUs  and  memory  modules.  The  Intercommunication  network  may  also 
have  some  Inherent  redundancy  (redundant  busses,  alternate  disjoint  paths, 
etc . ) . 

Software  Is  a major  resource  In  any  large  computing  system,  both  in 
terms  of  cost  and  as  sources  of  failures  [24].  Operating  systems  are  re- 
sponsible for  management  of  the  system  so  as  to  provide  the  best  use  and 
sharing  of  the  resources  among  the  users.  In  gracefully  degradable  sys- 
tems, software  Is  also  responsible  for  fault  diagnosis,  recovery  and  hard- 
ware reconfiguration.  Utility  software,  like  compilers,  file  systems  and 
text  editors.  Is  oriented  more  towards  aiding  the  users  than  overall  sys- 
tem management. 

The  services  provided  by  computing  systems  are  quite  varied,  from 
control  of  automatic  processes  up  to  the  broad  variety  of  services  pro- 
vided by  large  time  sharing  systems  serving  several  hundred  users.  The 
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type  and  variety  of  services  provided  Influence  very  significantly  the 
overall  system  performance  even  with  respect  to  reliability.  Restricted 
applications  can  be  satisfied  with  relatively  simple  software  while  some 
computation  centers  have  operating  systems  In  excess  of  one  hundred  thou- 
sand words  and  extremely  complex  Internal  organization  (complicating  re- 
covery and  reconfiguration). 

II. 2 FAILURES 

Failures  can  be  defined.  In  broad  terms,  as  any  event  that  breaks  the 
regular  system  operation.  Failures  can  be  hardware  malfunctions,  software 
errors  or  occurrences  of  a situation  that  the  software  cannot  handle  (espe- 
cially In  complex  real-time  systems).  In  computer  systems  with  large  soft- 
ware, software  unreliability  Is  a major  cause  of  failures  [24]  and,  conse- 
quently, reliability  models  should  take  it  Into  account. 

One  of  the  major  characteristics  of  failures  Is  their  duration. 
Transient  failures  correspond  to  malfunctions  thct  disappear  after  a short 
time  (I.e.,  short,  temporary  hardware  malfunctions  during  a burst  of 
electromagnetic  radiation  or  deadlock  situations  that  are  solved  by  killing 
one  process).  Permanent  (also  call  hard  or  solid)  failures  correspond  to 
permanent  malfunctions  that  necessitate  physical  repair  (or  debugging). 

The  extent  of  a failure  corresponds  to  the  range  of  the  corresponding 
malfunction.  Failures  can  be  catastrophic  If  the  malfunction  extends  to 
the  whole  system  (I.e.,  a power  supply  failure  In  a system  with  single 
supply  affects  the  operation  of  all  the  hardware).  In  general,  the  extent 
of  any  failure  Is  limited  to  one  (or  very  few)  element  (I.e.,  one  CPU  for 


most  of  the  CPU-related  failures,  one  memory  module  for  most  memory  \ 

failures).  It  should  be  noted  that  the  extent  of  a failure  characterizes 
the  physical  range  of  the  malfunction  and  not  the  range  of  the  failure 
effects  (e.g.,  a failure  that  occurs  In  a CPU  may  affect.  If  It  goes 
undetected  for  some  time,  the  validity  of  the  data  In  memory). 

II. 3 DETECTION.  RECOVERY  AND  RECONFIGURATION 

The  basic  Idea  behind  graceful  degradation  Is  to  detect  the  failures 
as  soon  as  they  occur,  determine  their  extent,  logically  disconnect  the 
faulty  element(s),  reconfigure  the  remaining  system  as  a useful  one,  try 
to  recover  from  the  failure  effects  on  data  Integrity  and  resume  normal 
operation  In  a degraded  mode  until  the  malfunction  Is  repaired. 

Detection  Is  achieved  by  hardware  (codes  [25],  memory  protection, 
self-checking  circuits  [26],  duplication  of  control  mechanisms,  time-out 
counters,  etc.)  by  software  (periodic  test  of  the  hardware,  retry  Instruc- 
tions, capability  lists  [27],  performance  monitors  and  schedulers  for  such 
problems  as  deadlocks  of  Infinite  loops,  etc.)  or  eventually  by  the  users 
and  operators  when  all  automatic  detection  methods  have  failed.  Failures 
can  be  differentiated,  with  respect  to  detection.  Into  two  classes.  The 
first  class  corresponds  to  the  failures  that  are  detected  as  soon  as 
they  occur  (or,  more  precisely,  by  the  first  error  they  cause;  that  Is 
to  say,  by  the  first  deviation  from  correct  operation).  Failures  of 
the  first  class  are  safe  because  they  are  detected  before  their  effects 
propagate.  The  second  class  groups  all  other  failures,  which  means  all 
failures  that  are  not  detected  upon  the  occurrence  of  their  Induced  first 
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error.  These  failures  are  unsafe  for  they  may  cause  data  contamination  or 
faulty  controls  before  they  are  detected.  Following  such  failures,  it  is 
necessary  to  assure  the  integrity  of  the  data  that  have  been  contaminated 
during  the  detection  latency  [28]  (the  time  period  between  the  first  error 
and  the  first  detected  error).  This  is  of  particular  importance  in  multi- 
access data  base  systems. 

After  a detection  occurs,  it  is  necessary  to  determine  the  extent  of 
the  failure.  Diagnostic  programs  and  testing  strategy  can  be  used  to  local- 
ize the  faulty  elements. 

According  to  the  extent  of  failures  and  the  range  of  their  effects, 
it  is  possible  to  isolate  the  faulty  element(s)  and  reconfigure  the  system 
in  such  a way  that  some  useful  service  is  still  available.  Reconfiguration 
involves  such  steps  as  possible  relocation  of  the  operating  system  in  a 
fault-free  memory,  memory  address  remapping  to  isolate  a faulty  memory 
module  or  running  an  n-processor  system  as  an  n-1  processor  system. 

Recovery  concerns  the  restoration  of  the  overall  system  control, 
the  restoration  of  the  integrity  of  data  (files,  data  bases,  etc.)  and 
the  ordered  restart  of  the  system  operation.  The  extent  of  the  recovery 
process  depends  heavily  on  the  type  of  applications.  Systems  oriented 
towards  multi-access  data  base  transactions  need  to  guarantee  high  inte- 
grity of  the  data  bases.  Failures  in  telephone  systems  can  result  in  the 
loss  of  some  calls  as  long  as  the  recovery  is  fast  enough  (i.e.,  to  ser- 
vice those  calls  on  the  dialing).  Restoration  of  data  integrity  (in- 
cluding operating  system  tables)  can  be  achieved  by  roll-back  and  rerun, 
use  of  traces,  updating  from  a redundant  copy  of  the  contaminated  data  or 
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program  roll-aheads.  Restoration  of  data  integrity  following  unsafe  failure 
is  complicated  by  the  fact  that  detection  and  diagnostic  mechanisms  may 
not  differentiate  between  safe  and  unsafe  failures  and  that  the  detection 
latency  of  unsafe  falures  is  unknown. 

11.4  PERFORMANCE 

The  major  concern  underlying  the  use  of  gracefully  degradable  systems 
is  to  provide  continuous  service  even  following  occurrences  of  failures. 
Availability,  defined  as  the  probability  the  system  is  up,  duration  of 
the  unavailability  periods,  average  computing  power  (taking  into  account 
operation  in  degraded  mode)  and  overhead  due  to  recovery  and  reconfigura- 
tion are  some  of  the  major  parameters  that  can  be  used  to  characterize 
gracefully  degradable  systems. 

System  failure  can  result  from  exhaustion  of  a resource  (catastrophic 
failures  or  series  of  failures  too  close  for  repair  to  be  completed)  or 
from  unsafe  failures.  During  the  detection  latency  of  unsafe  failures, 
the  system  is  running  but  produce?  faulty  results  (which  require  job  rerun 
after  the  failure  has  been  detected).  Duration  of  unavailability  periods 
is  heavily  dependent  upon  the  type  of  system  crash  (resource  exhaustion  or 
unsafe  failures)  and  their  associated  parameters  (repair  time  or  detection 
latency  plus  reconfiguration  and  recovery  overhead). 

The  average  computing  power  is  a function  of  the  time  spent  in  degraded 
mode,  of  repair  times  and  overhead  associated  with  recovery/reconfiguration. 
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in.  GENERAL  MODELING 
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111.1  SYSTEM  PARTITIONING 

Gracefully  degradable  systems  can  be  partitioned  Into  several 
Independent  resources . Each  resource  provides  the  system  with  a special 
type  of  service  and  Is  constituted  by  one  or  several  elements  that  are 
functionally  Identical  and  that  share  the  resource  load.  For  example, 
the  PRIME  system  (Figure  1)  can  be  partloned  Into  seven  resources:  the 
resource  for  processing  composed  of  five  CPUs  (CPU  ■ processor  plus  dedi- 
cated map  and  I/O  controller),  a fast  memory  resource  formed  of  13  memory 
modules,  a secondary  memory  resource  (disc  drives),  an  I/O  device  resource, 
two  communication  networks  (the  external  access  network  and  the  CPU-memory 
network)  and  a software  resource  (operating  system). 

For  the  system  to  be  operable,  a certain  degree  of  performance  Is  re- 
quired from  each  resource.  This  will  be  referred  to  as  the  minimum  con- 
figuration (e.g.,  one  processor,  one  memory  and  some  I/Os).  Overall 
system  performance  (computing  power,  availability,  etc.)  Is  directly  de- 
pendent upon  the  performance  of  each  resource.  So,  by  analyzing  each 
resource  separately  one  can  obtain  a fairly  accurate  evaluation  of  the 
overall  system  performance. 

111. 2 MODEL  FOR  THE  RESOURCES 

Some  resources,  like  the  memory  and  processing  resources,  have  highly 
parallel  structures.  They  are  formed  by  several  Identical,  physically  In- 
dependent el ements  and  the  amount  of  service  that  can  be  obtained  from  such 
a resource  Is  proportional  to  the  number  of  fault-free  elements.  However, 
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resources  like  the  Interconnection  networks  (and  software)  lack  such  a 
well  defined  parallelism  even  though  they  may  still  have  some  inherent 
redundancy  (not  every  failure  is  catastrophic).  Network  redundancy  (num- 
ber of  paths  between  two  nodes,  minimal  cuts,  etc.)  has  been  treated  in 
detail  [29,  30],  Redundancy  in  software  is  not  very  common.  So,  one 
may  consider  software  as  a resource  with  a single  element  for  which 
identical  copies  are  readily  available  from  disc  (the  presence  of  copies 
does  not  correspond  to  redundancy  for  every  copy  presents  the  same  defects). 
For  the  modeling  of  interconnection  networks,  one  can  find  the  average 
decrease  in  the  information  flow  due  to  a failure.  The  relative  value  of 
the  decrease  gives  an  indication  of  the  degree  of  parallelism  and  this  can 
be  used  to  model  real  networks  as  several  independent  fictitious  subnet- 
works in  parallel.  For  example,  in  the  CPU-memory  interconnection  net- 
work in  PRIME,  a pessimistic  analysis  shows  that,  in  average,  a failure 
in  the  net  disconnects  one  processor  from  the  memories.  So,  one  can 
approximately  model  such  a network  as  five  independent  fictitious  sub- 
networks in  parallel.  Each  failure  inside  a subnetwork  makes  it  totally 
inoperative  and  the  transfer  rate  on  the  subnetwork  is  one  fifth  of 
that  of  the  real  network. 

In  the  following,  one  will  assume  that  a resource  is  formed  by  n 
identical  and  independent  (failure-wise)  elements.  Out  of  these  n elements, 
m need  to  be  fault-free  for  the  system  to  be  able  to  run.  Element  failures 
are  grouped  into  two  classes  according  to  detection.  The  failures  that 
are  detected  on  the  first  error  they  cause  are  called  safe.  The  other 


failures  are  unsafe  and  characterized  by  the  latency  between  the  first  error 


and  the  first  detected  error.  Following  the  detection  of  unsafe  failures 
(most  of  the  failures  detected  by  periodic  test).  It  Is  necessary  to  re- 
store the  Integrity  of  the  data  that  may  have  been  contaminated.  One  will 
assume  that  the  time  needed  to  restore  system  Integrity  Is  directly  propor- 
tional to  the  detection  latency.  The  reasoning  behind  this  assumption  Is 
that  It  will  be  necessary  to  check  every  action  the  system  has  been  taking 
during  the  latency  period.  The  proportionality  coefficient,  e (e  for 
extent  of  recovery).  Is  strongly  dependent  upon  the  type  of  applications 
the  system  Is  used  for.  Electronic  reconfiguration  of  the  hardware  will 
be  considered  to  be  Instantaneous. 

The  elements  have  the  following  characteristics: 

X = failure  rate 
p * repair  rate 

a * probability  that  a failure  be  transient 
c = probability  that  a failure  be  safe 

— = detection  latency  of  the  unsafe  failures 
'"d 

e = multiplicative  factor  giving  the  recovery  time  from  the 
latency 

Vd 

V = = rate  of  complete  recovery  following  an  unsafe 

failure  (assumed  to  be  a constant  rate). 


1 

\ 


1 


da 


The  state  of  a resource  Is  fully  defined  by  the  number,  x,  of  fault- 
free  elements,  and  the  number,  y,  of  elements  which  either  have  undetected 
failures  or  are  recovering  from  them.  Such  a state  will  be  referred  to  a 
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Fig.  2-b.  Output  transitions  probabilities 
from  state  S 


Fig.  2.  Markov  chain  and  transition  probabilities. 


Logp-c.Of).  JlJ 

When  m is  larger  than  1,  the  analytical  solution  for  n*  Is  quite  complex. 

However,  If  one  expresses  n*  as  a function  of  m,  n*  = (l+e)-m,  one  can 

get  an  approximation  by  using  the  approximation  between  the  binomial  and  1 


I 


I 


T»ble  1 lists  a few  values  of  n*  as  a function  of  the  element 
characteristics  and  the  value  of  m.  It  is  striking  to  note  that  avail- 
ability is  maximum  when  the  number  of  elements*  n»  is  quite  close  to  the 
minimum,  m.  It  is  also  worthy  to  note  that  the  optimum  is  largely  insen 
sitive  to  the  characteristics  of  the  unsafe  failures.  The  similiarity 
between  these  results  and  those  concerning  optimization  of  hybrid  [9] 
and  stand-by  [31]  systems  is  quite  interesting. 


The  response  throughput,  E,  can  also  be  maximized 
to  1 , the  optimal  n**  is 


which,  as  expected,  is  far  larger  than  n*.  However,  the  maximization  of 
the  average  throughput  of  an  element,  ~ , gives  a result  very  close  to 
n*.  As  the  number  of  elements  is  Increased  beyond  n*,  both  the  availa- 
bility and  the  average  throughput  of  each  element  decreases. 


IV.  EXAMPLE 


IV. 1 DESCRIPTION 

To  Illustrate  the  model , .we  will  apply  It  to  an  example.  The  system 
modeled  will  be  a simplified  version  of  the  PRIME  system  [18,  22]  (Fig.  1): 
five  processors,  13  memory  modules,  15  discs,  an  external  access  network 
modeled  as  ten  independent  switches,  and  some  external  devices  (we  will 
assume  six  identical  units).  We  will  neglect  the  unreliability  of  the 
processor-memory  interconnection  network.  Because  of  the  unavailability 
of  failure-related  data  concerning  the  software  run  on  the  PRIME  system, 
we  will  assume  for  this  example,  that  the  system  is  running  a widely  used 
operating  system  (e.g.,  a combination  of  OS/MVT,  HASP  plus  some  editors,  j 

i 

for  which  some  failure-related  data  are  available).  1 

He  will  further  assume  that  the  hardware  is  equipped  with  some  error-  | 

detection  mechanism  (parity  checking).  Almost  all  the  failures  detected 
by  parity  are  detected  on  the  first  few  errors,  so  one  can  assume  that  all 

' 

failures  detected  by  hardware  are  safe.  The  system  is  also  periodically 
tested  by  software.  Failures  detected  by  software  are  likely  to  be  unsafe 
(they  have  bypassed  the  hardware  detection  mechanisms  and  are  detected  only 
by  extensive  periodic  tests).  The  probability,  c,  that  failure  be  detected 
by  the  hardware  fault  detection  mechanism  will  be  taken  as  .90  (the  use  of 
codes  should  guarantee  such  a coverage). 

The  frequency  of  the  software  periodic  tests  will  be  1 per  minute. 

Using  the  data  relevant  to  test  efficiency  in  [18],  the  conditional  proba- 
bility that  such  a test  detects  a failure  given  a failure  is  .90.  This 
leads  (assuming  Independence  of  successive  tests)  to  an  average  detection 
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latency  of  .6  Minutes.  Assuming  conservatively  that  recovery  takes  twice 
as  long,  the  average  unavailability  Is  around  2 minutes  long. 

Data  gathered  during  an  eight  month  period  on  an  IBM  360/91  running 
OS/NVT,  HASP  and  text  editor  "UYLBUR"  Indicate  that  the  meantime  between 
software  failures  Is  fairly  constant  and  runs  around  70  hours.  Two  thirds 
of  the  failures  cause  system  crash  and  necessitate  restart  (approximately 
15  minutes).  The  other  third  causes  system  degradation  for  around  15 
minutes.  So.  for  the  modeling  of  such  a software,  one  can  make  the  as- 
sumption that  there  are  two  software  resources.  The  O.S.  forms  a resource 

-2 

by  Itself  (one  element,  X ■ 10  /h,  y ■ 4/h).  The  rest  of  the  software 
will  be  divided  Into  two  Independent  elements  (X  - 2.5.10”^/h,  y » 4/h, 
c ■ 1).  A summary  of  the  system  parameters  Is  given  In  Table  2. 

IV. 2 RESULTS 

The  results  of  the  study  of  this  system  can  be  summerized  as  follows: 

• System  availability  ■ 99.74%. 

Software  failures  contribute  97%  of  all  unavailability. 

• Meantime  between  system  crashes  *57.5  hours. 

• Meantime  between  crashes  due  to  hardware  - 418  hours. 

• Average  down  time  per  crash  ■ 14  minutes. 

• Average  down  time  per  hardware-caused  crash  <■  2 minutes. 

• 17.4%  of  the  time,  the  system  operates  In  degraded  mode. 
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• Less  than  1%  of  all  hardware  unavailability  Is  due  to 


resource  exhaustion. 

• The  average  processing  power  Is  4.97  processors, 

12.98  memories, 

10.00  switches, 

14.69  discs,  and 
5.51  output  devices. 

When  the  unreliability  of  the  Input/output  devices  Is  not  taken  Into 
account,  the  meantime  between  crashes  due  to  hardware  Is  around  1700  hours 
and  the  availability  Is  .99998.  These  results  agree  reasonably  well  with 
those  obtained  by  Borgerson  [18].  So,  the  unreliability  of  the  Input/output 
device  Is  the  major  factor  limiting  the  hardware  availability.  It  should 
also  be  noted  that  not  every  failure  In  the  Input/output  devices  Is  safe. 

For  example.  In  the  360/67  system  at  Stanford,  out  of  61  line  printer  fail- 
ures, five  caused  system  crashes. 

It  may  be  Interesting  to  look  at  the  effects  of  the  hardware  detec- 
tion mechanisms  and  the  frequency  of  periodic  tests  on  the  availability 
of  the  hardware  (Figures  3 and  4). 

The  Introduction  of  hardware  detection  mechanisms  Is  extremely  bene- 
ficial , even  If  they  are  fairly  Inefficient  (small  values  of  c).  But, 
even  with  the  best  mechanisms,  availability  Is  limited.  The  frequency 
with  which  software  tests  are  run  has  the  same  effect  (Figure  4).  Fre- 
quent testing  Is  beneficial  but,  past  a certain  limit,  the  Incremental 
availability  gain  may  not  compensate  for  the  Increase  In  overhead. 


i 


10" 


Unavailability 


Probability  c 


Fig.  3>  Variation  of  the  unavailability  as  function  of  the  probability  c. 


Fig.  4.  Variation  of  the  unreliability  as  function  of  the  test  period. 


CONCLUSIONS 


The  previous  example  clearly  shows  the  benefits  of  graceful 
degradation  at  the  hardware  level.  Lengthy  downtimes  due  to  repairs 
are  eliminated.  Systems  operate  longer  without  Interruption,  unavaila- 
bility periods  are  quite  short  and  overall  processing  power  Is  not  af- 
fected much  by  failures.  This  type  of  smooth  operation  over  Indefinite 
periods  of  time  may  be  extremely  well  suited  for  a large  range  of  applica- 
tions. However,  gracefully  degrading  systems  are  as  varied  as  their  appli- 
cations so  that  an  universal  tool  for  performance  evaluation  Is  highly 
desirable. 


The  model  presented  here  Is  based  on  a general  description  of  the 
principles  behind  graceful  degradation.  It  Is  oriented  towards  perfor- 


mance evaluation  and  not  only  reliability  analysis.  Classification  of 


failures  between  safe  and  unsafe  allows  one  to  model  the  different 


methods  of  fault  detection.  Detection  latencies  are  useful  to  assess 


the  damages  that  a failure  can  create  before  It  Is  detected.  The  1m 


portance  accorded  to  data  Integrity  Is  taken  Into  account  In  modeT.Ing 
the  duration  of  the  recovery  processes.  The  partitioning  of  systeri!s  Into 
resources  simplify  considerably  the  analysis  and  provides  a way  to  take 
software  unreliability  Into  account. 


Study  of  the  model  shows  that  much  attention  should  be  given  to  , 
optimization.  Indeed,  by  Increasing  the  degree  of  redundancy  beyond  a\ 

\ 

certain  limit  some  of  the  performance  decreases;  for  example,  availability 
and  the  ratio  of  efficiency  by  cost.  The  model  also  points  to  the  \ 


r 


practical  interest  of  hardware  fault-detection  mechanisms  and  frequent 
software  tests  and  to  some  of  the  trade-offs. 

One  of  the  most  Interesting  problems  associated  with  gracefully 
degrading  systems  Is  the  structure  of  the  software  when  It  Is  considered 
as  a resource  which  can,  and  Indeed  does,  suffer  failures.  It  Is  hoped 
that  some  Investigation  of  this  problem  may  be  carried  out  In  the  near 
future. 


VI. 
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